In statisticiansix/demonstrandum:

knitr::opts_chunk$set(
    echo = FALSE,
    fig.height = 4.5,
    fig.width = 8,
    message = FALSE,
    warning = FALSE
)
library(DML)
library(tidyverse)
library(magrittr)
library(Rtsne)
library(gridExtra)
library(cluster)
library(factoextra)

results <- params$results

Analysis

We are going to conduct a clustering utilising k-Means on the data provided, using r ncol(attr(results,'args')$x)-1 r ifelse((ncol(attr(results,'args')$x)-1)==1,'column','columns') to r ifelse(!is.null(attr(results,'args')$optimal_k_method),sprintf('try and find the most appropriate number of clusters using the "%s" method.',attr(results,'args')$optimal_k_method),sprintf('identify %s clusters.',attr(results,'args')$k)) However first we will perform an exploratory analysis on the data.

Exploratory Analysis

We start with exploring the data to check the assumptions that are required for a suitable outcome from a k-Means clustering which are:

1) Variables should be continuous. 2) All variables should have the same variance. 3) The data should be spherically distributed. 4) It is reasonable to expect the cluster sizes should be equal. 5) It is expected that there is a set of clusters within the data.

continuousCheck <- classify_columns(attr(results,'args')$x %>% select(-Cluster))%>%filter(!grepl('continuous',classification))
if(nrow(continuousCheck)>1){
  continuousText <-sprintf("we find that there are %s variables (%s) that are potentially not continuous in nature and as such this might affect the quality of the final clustering solution.",nrow(continuousCheck),collapse_and(continuousCheck$column))

}else if(nrow(continuousCheck)==1){
  if((ncol(attr(results,'args')$x)-1)!=1){
    continuousText <- sprintf('we find one variable (%s) that is potentially not continuous and as such this might affect the quality of the final clustering solution.',continuousCheck$column)
  }else{
    continuousText <- 'As the variable that we are using to cluster is not continuous it is highly unlikely that k-Means will produce a suitable clustering, we would be better utilising a visual inspection to identify clusters in this case.'
  }
}else{
  if((ncol(attr(results,'args')$x)-1)!=1){
    continuousText <- "we see that the variables that we are using are all continuous."
  }else{
    continuousText <- "we find that the variable we are using is continuous, care should be taken however as using a single variable in k-Means might not produce a suitable clustering, we may be better utilising manual segmentation methods."
  }
}

First we will do a quick data check to ensure all our variables are continuous, and when we look at the data r continuousText r ifelse(attr(results,'args')$scale,"Within the clustering we have scaled the data in an attempt to stabilise the variance as such we shall plot the scaled data within the rest of our assessments.",'')

r ifelse((ncol(attr(results,'args')$x)-1)>2,'As we have more than 2 variables within our clustering we cannot visually assess the sphericity of the data set, as such we will have to bear this in mind when assessing the suitability of our data.',ifelse((ncol(attr(results,'args')$x)-1)==2,'As we are clustering using 2 variables we can visually inspect the sphericity of our data using the scatter plot below. We can also use this plot to visually assess whether we see any clusters within the data.','As we are clustering with one variable we cannot visually assess sphericity of the data points.'))

attr(results,'raw_plot')

In order to assess the last two assumptions in the list above we will use a Visual Assessment of cluster Tendency (VAT) with statistical tests to confirmr ifelse((ncol(attr(results,'args')$x)-1)<=2,'.','and a dimensionally reduced plot of our data to see if we can identify the number of clusters that may be present within our data. Within the dimensionally reduced plot we might not see spherical clusters, but if there are clusters present we should see evidence of this within the plot.')

attr(results,'raw_plot')

With the VAT if we have a clusterable data set we would expect to see distinct groups within the Ordered Dissimilarity Image, these would appear in the plot below as distinct blocks of where we see low similarity within potential clusters and high dissimilarity outside of potential clusters.

clusterability <- attr(results,'clusterability')
clusterability

The values of Hopkins' statistic, $H$, are between 0 and 1 with a value of 0 indicating the data is not clusterable, and a value of 1 indicating the data is clusterable, with a value of r attr(clusterability,'metrics')$hopkins the Hopkins' statistic in this case indicates r ifelse(attr(clusterability,'metrics')$hopkins<=0.5,'that the data might not be clusterable','that the data should be clusterable')

r ifelse(is.null(attr(results,'k_selection')),'','## k Selection')

optimal_k_intro <- sprintf('We are going to use the "%s" to identify the optimal number of clusters that we should use.',attr(results,'args')$optimal_k_method)

optimal_k <- sprintf('From this method we have identified that we are going to use %s clusters.',attr(results,'args')$k)

r ifelse(is.null(attr(results,'k_selection')),'',optimal_k_intro)

attr(results,'k_selection')

r ifelse(is.null(attr(results,'k_selection')),'',optimal_k)

Results

We have identified r attr(results,'args')$k clusters using r collapse_and(sprintf('"%s"',colnames(attr(results,'args')$x %>% select(-Cluster)))).

attr(results,'plot')

The cluster sizes are:

clustering_sizes <- attr(results,'data')%>%.['Cluster'] %>% count(Cluster) %>% rename('Size'='n') %>% mutate('%'=paste0(round(Size/sum(Size),2)*100,'%')) %>% select(Cluster,`%`,Size) %>% mutate('Text' = sprintf('%s (%s)',Size,`%`))
knitr::kable(clustering_sizes %>% select(-Text))

Next Steps

Now that we have obtained our clusters we should analyse the interpretation they have taking into account the way in which we are intending to use them and what was the initial question of interest that we wanted this analysis to answer.

Confirming suitability of clustering

Now that we have the clusters we should identify their suitability for further use and analysis. We can do so by investigating the actionability that these clusters have and how that might impact future work. If we cannot use these clusters to derive an action then we should investigate what might be causing the issues we are seeing and try remedial actions. Whilst we can use further metrics to quantify the final clustering, we favour subjective evaluation as the methods do not provide proof or evidence that there exists a better solution.

What to do if the clustering isn't suitable

We should first look to see if our data fails to meet the requirements or assumptions for k-Means clustering, and some potential remedial actions (should they exist).

1) Variables should be continuous.

We could look at using factor analysis to transform the variables into combinations of values that are continuous in nature

2) All variables should have the same variance.

We could scale our data to standardise the variance of our variables.

3) The data should be spherically distributed.

We will need to investigate other clustering methods where this assumption does not exist or investigate data transformations that could make our data spherical.

4) It is reasonable to expect the cluster sizes should be equal.

We will need to investigate other clustering methods where this assumption does not exist.

5) It is expected that there is a set of clusters within the data.

We should look at other analytical methods, as if there are no clusters we could expect to find we cannot identify them with cluster analysis.

What can we use these clusters for?

We can use this segments for creating groups for further analysis, either through descriptive analyses (customer segmentation etc.) or for use in further analyses such as modelling or predictive analysis.

Summary

In this report we detailed how we performed a k-Means clustering upon a data set with n columns and m observations, that we standardised to satisfy the recommended criteria for the algorithm. We created r attr(results,'args')$k clusters ranging from r clustering_sizes %>% filter(Size==min(Size)) %>% .[['Text']] to r clustering_sizes %>% filter(Size==max(Size)) %>% .[['Text']] in size.

\pagebreak